Day 14 - Regular expressions - Classes

I sat next to you in Mrs. Walsh’s English class!

Groundhog Day (1993)

You were so amazed by the power of regular expressions that you decided to come back and proceed

with your education! What did you say? I see, your boss forced you to learn regular expressions but

your real dream is to be an action films star. You should consider taking some acting classes. Speaking

of which, the topic of this lesson is classes in regular expressions, which are not evening courses,

but collections of characters.

In the previous chapter we learned how to match a single specific character in a regular expression

and how to match any single character. These two choices are very handy but often we need to

match a set of character, for example the numbers 1, 2, or 3, or the letters between “a” and “f”.

Neither of the two syntaxes we discussed last time can provide this sort of match, so we need a new

one.

In a regular expression, the syntax [<characters>] means “any single character in the list”, and it

is exactly what we need in this case. For example

$ cat examples.txt | grep -E "[abc]"

matches a single “a”, a single “b”, or a single “c”. Remember that grep highlights all matching

elements in each line, and prints the whole line, use the -o option if you want to print the matching

part only. The line “dog”, for example, is excluded from the output as it doesn’t contain any of the

three letters in the class.

Classes are especially useful because they allow you to use ranges. For example

$ cat examples.txt | grep -E "[a-z]"

matches all lowercase letters of the English alphabet. This will highlight whole words like “gorilla”

and “aardvark”, as they are composed of lowercase letters only. “Johnny 5”, instead, is not completely

highlighted, as the capital “J” and the number 5 are not matched by the regular expression.

If you use regular expressions in an editor to search for strings (I let you discover how your favourite

editor allows you to do this) the syntax [a-z] will match the first lowercase letter in the text.

Repeating the search will find the second one, and so on.

Typical ranges are a-z for lowercase letters, A-Z for uppercase ones, and 0-9 for digits. You can use

more than one range in a class, for example